{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Pandas-统计计算和描述"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* 内容介绍:pandas具有一组常用的统计方法\n",
    "* 官网中关于基本函数的介绍:https://pandas.pydata.org/docs/user_guide/basics.html"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import pandas as pd"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "d    0\n",
      "b    1\n",
      "c    2\n",
      "a    3\n",
      "e    4\n",
      "dtype: int64\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>B</th>\n",
       "      <th>A</th>\n",
       "      <th>C</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>d</th>\n",
       "      <td>47</td>\n",
       "      <td>82</td>\n",
       "      <td>45</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>b</th>\n",
       "      <td>46</td>\n",
       "      <td>58</td>\n",
       "      <td>76</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>c</th>\n",
       "      <td>52</td>\n",
       "      <td>65</td>\n",
       "      <td>56</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>a</th>\n",
       "      <td>83</td>\n",
       "      <td>45</td>\n",
       "      <td>93</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    B   A   C\n",
       "d  47  82  45\n",
       "b  46  58  76\n",
       "c  52  65  56\n",
       "a  83  45  93"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 示例数据\n",
    "s0 = pd.Series(range(5),index=['d','b','c','a','e'])\n",
    "print(s0)\n",
    "df0 = pd.DataFrame(np.random.randint(40,100,size=(4,3)),index=['d','b','c','a'],columns=['B','A','C'])\n",
    "df0"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>B</th>\n",
       "      <th>A</th>\n",
       "      <th>C</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>d</th>\n",
       "      <td>47.0</td>\n",
       "      <td>82.0</td>\n",
       "      <td>45.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>b</th>\n",
       "      <td>46.0</td>\n",
       "      <td>58.0</td>\n",
       "      <td>76.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>c</th>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>a</th>\n",
       "      <td>83.0</td>\n",
       "      <td>45.0</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "      B     A     C\n",
       "d  47.0  82.0  45.0\n",
       "b  46.0  58.0  76.0\n",
       "c   NaN   NaN   NaN\n",
       "a  83.0  45.0   NaN"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df0.loc['a','C']=None\n",
    "df0.loc['c','A']=None\n",
    "df0.loc['c']=None\n",
    "df0"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 1.求和运算及其他统计基本运算\n",
    "\n",
    "主要的参数如下："
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "|序号|参数|说明|\n",
    "|----|----|----|\n",
    "|01|axis|约简的轴。DataFrame的行用0,列用1|\n",
    "|02|skipna|排除缺失值，默认值为True，即计算时排除缺失值|\n",
    "|03|level|如果轴是层次化索引的(即MultiIndex)，则根据level分组简约|"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "B    228.0\n",
       "A    185.0\n",
       "C    177.0\n",
       "dtype: float64"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#sum()函数默认求dataframe列的和\n",
    "df0.sum()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "d    174.0\n",
       "b    180.0\n",
       "c      0.0\n",
       "a    128.0\n",
       "dtype: float64"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#如果求行的和，可以设置axis参数为1\n",
    "df0.sum(axis=1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "d    174.0\n",
       "b    180.0\n",
       "c      NaN\n",
       "a      NaN\n",
       "dtype: float64"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#涉及缺失值的情况，可以使用skipna参数\n",
    "# skipna参数，默认为Ture即排除缺失值。当为False时，不排除缺失值，那么求和出现了NaN\n",
    "df0.sum(axis=1,skipna=False)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "B    a\n",
       "A    d\n",
       "C    b\n",
       "dtype: object"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#inxmax返回最大值的索引\n",
    "df0.idxmax()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>B</th>\n",
       "      <th>A</th>\n",
       "      <th>C</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>d</th>\n",
       "      <td>47.0</td>\n",
       "      <td>82.0</td>\n",
       "      <td>45.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>b</th>\n",
       "      <td>93.0</td>\n",
       "      <td>140.0</td>\n",
       "      <td>121.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>c</th>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>a</th>\n",
       "      <td>176.0</td>\n",
       "      <td>185.0</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "       B      A      C\n",
       "d   47.0   82.0   45.0\n",
       "b   93.0  140.0  121.0\n",
       "c    NaN    NaN    NaN\n",
       "a  176.0  185.0    NaN"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#cumsum()样本值的累计求和公式\n",
    "df0.cumsum()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2.综合统计结果describe\n",
    "\n",
    "使用比较广泛，获取数据后可以先使用此方法查看数据的综合情况"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>B</th>\n",
       "      <th>A</th>\n",
       "      <th>C</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>count</th>\n",
       "      <td>3.000000</td>\n",
       "      <td>3.000000</td>\n",
       "      <td>2.00000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>mean</th>\n",
       "      <td>58.666667</td>\n",
       "      <td>61.666667</td>\n",
       "      <td>60.50000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>std</th>\n",
       "      <td>21.079216</td>\n",
       "      <td>18.770544</td>\n",
       "      <td>21.92031</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>min</th>\n",
       "      <td>46.000000</td>\n",
       "      <td>45.000000</td>\n",
       "      <td>45.00000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>25%</th>\n",
       "      <td>46.500000</td>\n",
       "      <td>51.500000</td>\n",
       "      <td>52.75000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>50%</th>\n",
       "      <td>47.000000</td>\n",
       "      <td>58.000000</td>\n",
       "      <td>60.50000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>75%</th>\n",
       "      <td>65.000000</td>\n",
       "      <td>70.000000</td>\n",
       "      <td>68.25000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>max</th>\n",
       "      <td>83.000000</td>\n",
       "      <td>82.000000</td>\n",
       "      <td>76.00000</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "               B          A         C\n",
       "count   3.000000   3.000000   2.00000\n",
       "mean   58.666667  61.666667  60.50000\n",
       "std    21.079216  18.770544  21.92031\n",
       "min    46.000000  45.000000  45.00000\n",
       "25%    46.500000  51.500000  52.75000\n",
       "50%    47.000000  58.000000  60.50000\n",
       "75%    65.000000  70.000000  68.25000\n",
       "max    83.000000  82.000000  76.00000"
      ]
     },
     "execution_count": 21,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 常用的数据综合统计表示方法\n",
    "df0.describe()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {
    "collapsed": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0     a\n",
       "1     d\n",
       "2     g\n",
       "3     e\n",
       "4     f\n",
       "5     a\n",
       "6     d\n",
       "7     g\n",
       "8     e\n",
       "9     f\n",
       "10    a\n",
       "11    d\n",
       "12    g\n",
       "13    e\n",
       "14    f\n",
       "15    a\n",
       "16    d\n",
       "17    g\n",
       "18    e\n",
       "19    f\n",
       "dtype: object"
      ]
     },
     "execution_count": 23,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "s2 = pd.Series(['a','d','g','e','f']*4)\n",
    "s2"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "count     20\n",
       "unique     5\n",
       "top        a\n",
       "freq       4\n",
       "dtype: object"
      ]
     },
     "execution_count": 24,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# describe可以根据数据的情况自动选择显示结果\n",
    "s2.describe()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "count    5.000000\n",
       "mean     2.000000\n",
       "std      1.581139\n",
       "min      0.000000\n",
       "25%      1.000000\n",
       "50%      2.000000\n",
       "75%      3.000000\n",
       "max      4.000000\n",
       "dtype: float64"
      ]
     },
     "execution_count": 26,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# describe可以根据数据的情况自动选择显示结果\n",
    "s0.describe()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 3.主要函数"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "|NO.|Function|Description|说明(中文)|\n",
    "|----|----|----|----|\n",
    "|01|count|Number of non-NA observations|非空数值的统计|\n",
    "|02|sum|Sum of values|求和|\n",
    "|03|mean|Mean of values|求平均值|\n",
    "|04|mad|Mean absolute deviation|返回所请求轴的值的平均绝对偏差|\n",
    "|05|median|Arithmetic median of values|求中值|\n",
    "|06|min|Minimum|求最小值|\n",
    "|07|max|Maximum|求最大值|\n",
    "|08|mode|Mode|获取沿所选轴的每个元素的模式|\n",
    "|09|abs|Absolute Value|求绝对值|\n",
    "|10|prod|Product of values|返回所请求轴的值的乘积。|\n",
    "|11|std|Bessel-corrected sample standard deviation|求标准差|\n",
    "|12|var|Unbiased variance|求方差|\n",
    "|13|sem|Standard error of the mean|返回所请求轴上的平均值的无偏标准误差|\n",
    "|14|skew|Sample skewness (3rd moment)||\n",
    "|15|kurt|Sample kurtosis (4th moment)|使用Fisher的峰度定义（正常的峰度== 0.0）在请求的轴上返回无偏峰度。|\n",
    "|16|quantile|Sample quantile (value at %)||\n",
    "|17|cumsum|Cumulative sum|返回数据轴上累计求和|\n",
    "|18|cumprod|Cumulative product|返回数据轴上累计乘积|\n",
    "|19|cummax|Cumulative maximum|返回数据轴上累计最大值|\n",
    "|20|cummin|Cumulative minimum|返回数据轴上累计最小值|"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.10"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
